Over the past 30 years, Real-Time Location Systems (RTLS) have become the standard for enterprises and consumers to connect to their loved ones, meet business goals, and find their way home. One place where RTLS is gaining further application is its use in helping keep students connected across college campuses. In this paper, we present how RTLS access points within a college campus building can locate handheld connected devices (such as laptops or cell phones). Underlying this methodology is the understanding of how the signal strength of a device behaves at various distances, angles and even through walls which we demonstrate in a practical application. Integral to this is testing of several machine learning clustering models that will be used to predict and identify the location of these devices using the Sum of Squared Errors as the evaluation criteria. Using Weighted and Unweighted K-Nearest Neighbors methods, we demonstrated that you can significantly increase the predictability of device location if you are able to use a combination of “co” and “cd” access points. These findings can have practical application in optimized management of resource usage and in providing elasticity to the bandwidth an enterprise uses based on the location of devices loading an access point, which can offer cost benefits.
With the advent of Wi-Fi and local area networks, devices like Real-Time Location Systems (RTLS) can be leveraged for location positioning of an object in a specified area in real time. This is made possible by way of a continuous communication and feedback between a tag attached to the object being tracked and reader or array of readers. Examples of RTLS include Infrared, Bluetooth, Cellular, and Radio Frequency Identification (RFID). To operate RTLS, a scanning device is required to locate an object (such as a cell phone or laptop) based on the angle and coordinates of the object being tracked. You can use multiple scanning devices to triangulate the exact location of the object using the combination of angles in which the signal is triggering the respective receivers. Utilizing a series of wireless network signals in an office building, we will be able to detect the exact location of various objects in real time using different combinations of network devices. Here, we can perform an unweighted and weighted k-nearest neighbors (k-NN) analysis to predict the location of the unknown Online data using the Offline training data. To go further, we will be seeing if we can better predict the location of the online data using different combinations of the network device mac addresses available.
The code and methodology used for this evaluation comes primarily from Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving by Deborah Nolan of the University of California, Berkeley and Duncan Temple Lang of the University of California, Davis. The application of the code and methodoogy were further examined and proved to be sound by Southern Methodist University, as incorporated into the lectures given by Adjunct Lecturer Bradley Blanchard as part of the university’s Quantifying The World course.
The \(k\)-nearest neighbors method is a non-parametric classification method. Unlike parametric methods, like logistic regression, which learn the parameters through training, \(k\)-nearest neighbors uses a user defined tuning parameter \(k\) to determine how the model will be trained. The \(k\)-nearest neighbors model uses \(k\) neighbors to estimate the probability of each class \(g\) as the proportion of the \(k\) neighbors of \(x\) with that class, \(g\).
\[ \hat{p}_{kg}(x) = \hat{P}_k(Y = g \mid X = x) = \frac{1}{k} \sum_{i \in \mathcal{N}_k(x, \mathcal{D})} I(y_i = g) \] The unknown \(x\) is then assigned the class with the highest estimated probability. The \(k\) nearest neighbors are determined by a measure of distance like Euclidean distance or Manhattan distance. In unweighted models classification of \(x\) is determined the simple plurality of neighbors, but in weighted models the distances of the \(k\) neighbors will have a bearing on the classification of \(x\).
Using the \(k\) tuning parameter, we will estimate the online device locations by assigning them with the location of \(k\) devices with the closest signal strength. We will measure the closeness in terms of signal strength of our training observations and test prediction using a Euclidean distance metric. Once we have all these measurements, we will measure the overall performance of our model by an aggregate sum of squared errors metric.
We will test are two types of k-NN algorithms: weighted and unweighted. In the case of the unweighted, the k nearest neighbors are assigned equal weight in estimating the location of the device in question. The location of the test device is simply an arithmetic average of the (x,y coordinates) of the closest nearest neighbor device. For the weighted average algorithm, the \(k\) nearest neighbors are each assigned weights that are equal to the reciprocal of their distance to the test device in terms of signal strength. Therefore, training observations with closer signal strength distances are given greater weight in estimating the location of the test observation.
As a baseline, we will first do a search for an optimal model excluding 00:0f:a3:39:dd:cd. This will align with the original research methods of the reference case study and provide an initial set of results to compare subsequent models against.
To complete this analysis, We employed the following steps.
For this project, we will be leveraging two separate datasets for our analysis. One of which is a reference set named offline which contains signal strength measurements from a hand-held device on a gridwork of 166 different points, all of which were spaced 1 meter apart. This gridwork is located in the hallways of a one floor building at the University of Mannheim. The other dataset is titled online which we will be using for testing our k-NN model to predict the location of an unknown device. The online dataset includes 60 different locations chosen at random with 110 signals measured from each point by six 6 wireless access points. In the below, you can see a map of the online test locations (black dots) overlaid with the offline training locations (grey dots). Both datasets contain the same features and will require the same procedures for cleaning.
knitr::include_graphics("CleverZonkedElk.png")Office layout: Floor plan of the first floor of the test building in the University of Mannheim. Depicts the offline points (grey dots) placed about a meter apart from one another and the placement of the online points (black dots) scattered randomly throughout for testing. Wifi access points are denoted by blue squares.
pander::pander(
list(
t = "Time stamp (Milliseconds) since 12:00am, January 1, 1970",
Id = "router MAC address",
Pos = "Router location",
Degree = "Orientation of scanning device carried by the researcher, measured in Degrees",
MAC = "MAC address of either the router, or scanning device, combined with corresponding values for signal strength (dBm), the mode in which it was operating(adhoc scanner = 1, access router = 3), and its corresponding channel frequency.",Signal = "Received Signal Strength in DbM")
)The first cleaning method we will employ will be to break up the position variable into separate variables which we can use to triangulate the location. In our raw dataset, we have position values for latitude, longitude and elevation separated by commas, which we will convert into PosX, PosY and PosZ. Upon further cleaning, we were able to determine that there was only one unique value for PosZ at 0 (which made sense considering the experiment took place in a one story building), we had the liberty to drop the variable. Additionally, running a procedure to check on the number of unique variables in the ScanMac column yielded only a single unique value, so we can drop that one as well.
In our documentation, we found that our type of device was mixed between the values 1 and 3, which we may want to clarify more. Reviewing our documentation, we will only want to focus on fixed access points (value=3) as that is more relevant to our study of predicting device locations using a fixed set of receivers. So, moving further, we will remove the adhoc instances in our dataset.
The Time measurement is something that we will want to make an adjustment to so that we can more easily analyze in the future. As mentioned prior, the time data is based on the number of milliseconds from a specific date (which could possibly be arbitrary), so we can change to a Year-Month-Day-Time format. But first, we can divide the number of milliseconds to seconds. This leaves us with the following features that we will use across our offline and online data. Additionally, we will remove the channel feature since it is strictly a character code that contains redundant identifiers of Mac Address, signal strength, frequency and mode that may play an unfair role in our predictive modeling.
#processLine performs splitting and cleansing of delimiters in the lines
processLine = function(x) {
tokens = strsplit(x, "[;=,]")[[1]]
if (length(tokens) == 10) {
return(NULL)
}
tmp = matrix(tokens[-(1:10)], , 4, byrow = TRUE)
cbind(matrix(tokens[c(2, 4, 6:8, 10)], nrow(tmp), 6, byrow = TRUE), tmp)
}
#roundReaderOrientation adjusts angles to increments of 45 degrees
#this is done to simplify calculations overall since the resolution
#of these angles is less important than that which is provided
roundReaderOrientation = function(orientation) {
refs = seq(0, by = 45, length = 9)
angle = sapply(orientation, function(o) which.min(abs(o - refs)))
c(refs[1:8], 0)[angle]
}
# read in the data for the appropriate mac addresses
processTraceFile <- function(filename, subMacs = c("00:0f:a3:39:e1:c0", "00:0f:a3:39:dd:cd",
"00:14:bf:b1:97:8a", "00:14:bf:3b:c7:c6",
"00:14:bf:b1:97:90", "00:14:bf:b1:97:8d",
"00:14:bf:b1:97:81")) {
txt = readLines(filename)
lines = txt[substr(txt, 1, 1) != "#"]
tmp = lapply(lines, processLine)
# create dataframe and name columns
data = as.data.frame(do.call(rbind, tmp), stringsAsFactors = FALSE)
names(data) = c("time", "scanMac", "posX", "posY", "posZ",
"orientation", "mac", "signal", "channel",
"type")
# keep only signals from access points (=3)
data = data[data$type == "3", ]
# drop scanMac, posZ, channel, and type - no info in them
dropVars = c("scanMac", "posZ", "channel", "type")
data = data[, !(names(data) %in% dropVars)]
# drop more unwanted access points
data = data[data$mac %in% subMacs, ]
# convert numeric values
numVars = c("time", "posX", "posY", "orientation", "signal")
data[numVars] = lapply(data[numVars], as.numeric)
# convert time to POSIX
data$rawTime = data$time
data$time = data$time/1000
class(data$time) = c("POSIXt", "POSIXct")
# round orientations to nearest 45 degree angle
data$angle = roundReaderOrientation(data$orientation)
return(data)
}offline <- processTraceFile("offline.final.trace.txt")
online <- processTraceFile("online.final.trace.txt")
#offline = readLines("http://rdatasciencecases.org/Data/offline.final.trace.txt")
#online = readLines("http://rdatasciencecases.org/Data/online.final.trace.txt")Reviewing our cleansed data set, we can see the following:
head(offline)Above is a view of our cleaned dataset. We are working with a time stamp value, an X and Y location value of our access point, the Mac address we are working with, a signal strength and the angle in which the signal was recieved.
It should be noted that the original researchers chose to exclude data from one access point. Of the 7 access points, 2 were Alpha routers. Mac 00:0f:a3:39:e1:c0 was kept for the analysis, and mac 00:0f:a3:39:dd:cd was removed. In this analysis we will endeavor to determine whether this action was warranted.
According to our documentation, we should have only 8 values for orientation. We also simplify our measurement of the reader orientation and round these measurements to the nearest 45 degree angle. The resolution of the existing data is less important, thus we can safely assume that 45 degree increments is sufficient, and this will provide some relief in terms of computational effort.
We will look at the orientation column of our dataset, we can see that we have a wide variety of angles available in clusters around the expected angles (such as 179 or 181) as shown below. Since we are going to focus on measure signal strength at 8 orientations in 45-degree increments, we will round each of our orientations to the nearest 45 degree increment. Additionally, we will try to map values close to 360 so that they line up back to zero.
length(unique(offline$orientation))#> [1] 203
plot(ecdf(offline$orientation), main = 'Orientation for the Hand-Held Device', xlab = 'CDF', ylab = 'Orientation')Orientation for the Hand-Held Device: The location of the orientation values as it relates to the empirical cdf. We can see that at each major orientation (45, 90, 135, 180, ect) are scattered around these values. So our cleaning procedure is going to round these to the nearest 45 degree angle. After making the adjustment, we can see that the new values look like they are more exact to the 8 angles we are using.
with(offline, boxplot(orientation ~ angle,
xlab = 'Rounded 45 Degree Angle',
ylab = 'Orientation'))